Topics covered in this notebook:
You can skip ahead in the video recording of this lecture by scanning ahead for large colored banner with next topic, like the following banner.
import matplotlib.pyplot as plt
n_topics = 8
topic_i = 0
def new_topic(txt):
global topic_i
topic_i += 1
txt = f'\n ({topic_i} of {n_topics})\n\n ' + txt + ' \n'
font = {'family': 'serif',
'color': 'darkblue',
'weight': 'bold',
'size': 40,
}
# plt.figure(figsize=(30, 18))
plt.axis('off')
plt.text(0.5, 0.5, txt, ha='center', wrap=True,
backgroundcolor='lightyellow',
fontdict=font,
bbox=dict(facecolor='yellow',
edgecolor='blue',
linewidth=5))
new_topic('Loading air quality data as pandas DataFrame')
import numpy as np
import matplotlib.pyplot as plt
import pandas # for reading csv file
from IPython.display import display, clear_output # for animations later in this notebook
First, let's find and download some interesting data. The machine learning repository at the University of California, Irvine, is a great resource for publicly available data with explanations for machine learning researchers. Here we download the air quality data set. If curl
is not available on your system, you may use the above link to find and download this data. It is useful to go the link and find the page describing this data set. That page is here.
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCI.zip
!unzip -o AirQualityUCI.zip
We will use the pandas package to read this data. The pandas.read_csv
function is extremely useful for reading in all kinds of data with various peculiarities. Here are the first few lines of AirQualityUCI.csv
.
!head AirQualityUCI.csv
Notice a few things. Fields are separated by semi-colons. The first line is names for each variable, appearing in separate columns. Each row is one sample. Each line ends with two semi-colons. Not immediately obvious is that the decimal values follow the European convention of using a comma instead of decimal point. Not demonstrated in these first few lines is the fact that missing measurements are given the value -200.
All of these issues can be dealt with directly in the call to pandas.read_csv
. I don't mean to imply that I got this right on my first try. The two most puzzling issues were the two semi-colons at the end of each line and the commas for decimal points. The double semi-colons caused the data returned by pandas.read_csv
to have more columns than I expected.
Very good pandas tutorials are available, such as Pandas Illustrated: The Definitive Visual Guide to Panda by Lev Maximov, and Pandas tutorials.
data = pandas.read_csv('AirQualityUCI.csv', delimiter=';', decimal=',', usecols=range(15), na_values=-200)
data = data.dropna(axis=0)
data.shape
So, we have 827 rows and 15 columns of data. This means that we read 827 samples that do not have missing values, and each sample contains 15 values. Let's look at the first few rows of this data matrix, called a DataFrame
in pandas
.
data.head(10)
Let's create a simple problem for playing with this data. Let's say we want to predict the level of carbon monoxide from the time of day. The column Time
contains the hour, but not just the hour. 9am will appear as 09.00.00. Whoopee! This will give us a chance to practice our skills at extracting substrings, converting strings to integers, and doing these steps for all of the Time
values within a concise little list comprehension. You don't know what this is? Well, it is time to get comfortable not knowing, and typing 'python list comprehension' into your favorite web search engine.
data['Time'][:10]
[t for t in data['Time'][:10]]
[t[:2] for t in data['Time'][:10]]
[int(t[:2]) for t in data['Time'][:10]]
hour = [int(t[:2]) for t in data['Time']]
len(hour)
To get the carbon monoxide measurements for each sample, you can read the data description at the UCI web site to learn that column CO(GT)
is the ground truth measurement of carbon monoxide.
data.columns
CO = data['CO(GT)']
CO[:10]
new_topic('Extracting data we want from the DataFrame, into $X$ and $T$')
Here I will introduce a convention I will follow throughout this class. Inputs to a model are given in a matrix named X
. Samples are in rows, and the components, measurements, variables, thingies of each sample are given in the columns. The desired, correct outputs for each sample are given in a matrix named T
, for Targets. The $i^{th}$ row of X
is Sample $i$ whose correct target output is in row $i$ of T
. Yes, you excellent software developers, X
and T
are parallel arrays, which should set of alarms in your coding brains. As long as we remember that we cannot reorder the rows n X
without doing the same reording of rows in T
, we will be okay.
Let's set this up for our hour to CO problem.
T = CO
T = np.array(T).reshape((-1, 1)) # make T have one column and as many rows as needed to hold the values of T
Tnames = ['CO']
X = np.array(hour).reshape((-1, 1))
Xnames = ['Hour']
print('X.shape =', X.shape, 'Xnames =', Xnames, 'T.shape =', T.shape, 'Tnames =', Tnames)
# or, using the latest formatting ability in python strings,
print(f'{X.shape=} {Xnames=} {T.shape=} {Tnames=}')
Say after me...plot early and often! We can never have too many visualizations. This next plot verifies that we have defined X
and T
correctly. What else do you notice?
plt.plot(X, T)
Whoa! That's a mess. Why?
Let's just plot a circle marker on each data point. And, never, never, forget to label the $x$ and $y$ axes.
plt.plot(X, T, '.')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0]); # semi-colon here prevents printing the cryptic result of call to plt.ylabel()
Well, what do you think? Will we be able to predict CO
from Hour
with a linear model? The predictions of linear model must appear as a straight line in this plot.
new_topic('A linear model')
What is a linear model?
Well, a linear model of one variable is specified with a y-intercept and a slope. These are the two parameters of the linear model. Let's call them w0
and w1
. If the output of the linear model is y
, then we have y = w0 + x * w1
. Latex makes a nice mathy representation of this.
Let's wrap this up in a little function.
def linear_model(x, w0, w1):
return w0 + x * w1
So, what values should we use for w0
and w1
to make a good prediction of CO
? What method shall we use to find good values? How about good old guessing, or maybe we could call this trial and error. We will just pick some values and plot the predictions.
new_topic('Optimizing weights of linear model with manual guessing')
w0 = 0
w1 = 1
Y = linear_model(X, w0, w1)
plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend(); # make legend using the label strings
Well, clearly our predictions are climbing much too quickly The slope, or w1
, is too high. Try a smaller value.
w1 = 0.1
Y = linear_model(X, w0, w1)
plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend();
Maybe too low.
w1 = 0.3
Y = linear_model(X, w0, w1)
plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend();
Okay. Now we can try to increase the y-intercept, w0
, a bit.
w0 = 0.4
Y = linear_model(X, w0, w1)
plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend();
We could do this all day. (Well, maybe a few more times.)
What we really need is a way to quantify how good our linear model is doing. Let's define a function to calculate the error by calling linear_model
to get our predictions, Y
, and compare them to the target T
values. The comparison will be done with the common root-mean-square-error, or RMSE, approach, for which the difference between T
and Y
is squared, averaged, and the square root of the result is returned.
def rmse(X, T, w0, w1):
Y = linear_model(X, w0, w1)
return np.sqrt(np.mean((T - Y)**2))
Now we are ready to automate the guessing approach we attempted above. Let's define a search algorithm that
This search algorithm is sometimes called coordinate descent.
new_topic('Optimizing weights with Coordinate Descent')
First, let's modify w0
.
w0 = 0.4 # Initial guess at weight values
w1 = 0.5
dw = 0.1 # How much to change a weight's value on each step.
current_error = rmse(X, T, w0, w1)
up_error = rmse(X, T, w0 + dw, w1)
down_error = rmse(X, T, w0 - dw, w1)
if down_error < current_error:
dw = -dw
new_error = down_error
else:
new_error = up_error
while new_error <= current_error:
current_error = new_error
w0 = w0 + dw
new_error = rmse(X, T, w0, w1)
print(f'{w0=:5.2f} {new_error=:.5f}')
Now let's modify $w_1$.
dw = 0.1
current_error = rmse(X, T, w0, w1)
up_error = rmse(X, T, w0, w1 + dw)
down_error = rmse(X, T, w0, w1 - dw)
if down_error < current_error:
dw = -dw
new_error = down_error
else:
new_error = up_error
while new_error <= current_error:
current_error = new_error
w1 = w1 + dw
new_error = rmse(X, T, w0, w1)
print('w1 = {:.2f} new_error = {:.5f}'.format(w1, new_error))
Lot's of repeated code here. We don't want to copy and paste for each iteration. All steps are put together in the following function. Let's collect the RMSE after each update in a list named error_sequence
, and also the values of w0
and w1
after each update in list named W_sequence
.
def coordinate_descent(errorF, X, T, w0, w1, dw, nSteps):
step = 0
current_error = errorF(X, T, w0, w1)
error_sequence = [current_error]
W_sequence = [[w0, w1]]
changed = False
while step < nSteps:
step += 1
if not changed:
dw = dw * 0.1
changed = False
# Iteratively update w0, until no improvement.
up_error = errorF(X, T, w0 + dw, w1)
down_error = errorF(X, T, w0 - dw, w1)
if down_error < current_error:
dw = -dw
while True:
new_w0 = w0 + dw
new_error = errorF(X, T, new_w0, w1)
if new_error >= current_error or step > nSteps:
break
changed = True
w0 = new_w0
W_sequence.append([w0, w1])
error_sequence.append(new_error)
current_error = new_error
step += 1
# Now iteratively update w1, until no improvement.
up_error = errorF(X, T, w0, w1 + dw)
down_error = errorF(X, T, w0, w1 - dw)
if down_error < current_error:
dw = -dw
while True:
new_w1 = w1 + dw
new_error = errorF(X, T, w0, new_w1)
if new_error >= current_error or step > nSteps:
break
changed = True
w1 = new_w1
W_sequence.append([w0, w1])
error_sequence.append(new_error)
current_error = new_error
step += 1
# When nSteps have been taken, return the two weights and the two lists of sequences.
return w0, w1, error_sequence, W_sequence
We will need some functions to help us create plots showing the error going down and the sequence of weight values that were tried. Read through this code. It will be very helpful to fully understand this code.
def plot_sequence(error_sequence, W_sequence, label):
plt.subplot(1, 2, 1)
plt.plot(error_sequence, 'o-', label=label)
plt.xlabel('Steps')
plt.ylabel('Error')
plt.legend()
plt.subplot(1, 2, 2)
W_sequence = np.array(W_sequence)
plt.plot(W_sequence[:, 0], W_sequence[:, 1], '.-', label=label)
plot_error_surface()
def plot_error_surface():
n = 20
w0s = np.linspace(-5, 5, n)
w1s = np.linspace(-0.5, 1.0, n)
w0s, w1s = np.meshgrid(w0s, w1s)
surface = []
for w0i in range(n):
for w1i in range(n):
surface.append(rmse(X, T, w0s[w0i, w1i], w1s[w0i, w1i]))
plt.contourf(w0s, w1s, np.array(surface).reshape((n, n)), cmap='bone')
# plt.colorbar()
plt.xlabel('w_bias')
plt.ylabel('w')
def show_animation(model, error_sequence, W_sequence, X, T, label):
W_sequence = np.array(W_sequence)
fig = plt.figure(figsize=(15, 8))
plt.subplot(1, 3, 1)
error_line, = plt.plot([], [])
plt.xlim(0, len(error_sequence))
plt.ylim(0, max(error_sequence))
plt.subplot(1, 3, 2)
plot_error_surface()
w_line, = plt.plot([], [], 'y.-', label=label)
plt.legend()
plt.subplot(1, 3, 3)
plt.plot(X, T, 'o')
model_line, = plt.plot([], [], 'r-', lw=3, alpha=0.5, label=label)
plt.xlim(0, 24)
plt.ylim(np.min(T), np.max(T))
for i in range(len(W_sequence)):
error_line.set_data(range(i), error_sequence[:i])
w_line.set_data(W_sequence[:i, 0], W_sequence[:i, 1])
Y = model(X, W_sequence[i, 0], W_sequence[i, 1])
model_line.set_data(X, Y)
#plt.pause(0.001)
clear_output(wait=True)
display(fig)
clear_output(wait=True)
Now let's try these functions to illustrate coordinate descent, given the initial values of w0
and w1
, and the parameter values that control the optimization, nSteps
and dw
.
w0 = -2
w1 = 0.5
nSteps = 200
dw = 10
w0, w1, error_sequence, W_sequence = coordinate_descent(rmse, X, T, w0, w1, dw, nSteps)
print(f'Coordinate Descent: Error is {rmse(X, T, w0, w1):.2f} W is {w0:.2f}, {w1:.2f}')
Well, did we succeed? Hard to know from these three numbers. Let's plot stuff to get a better understanding.
show_animation(linear_model, error_sequence, W_sequence, X, T, 'coord desc')
Okay, well that's fun, but this becomes kind of silly when we try to apply this to other models that have more weights, like thousands, or millions. Instead, we need a way to find a direction in which we can change both weights, meaning all weights, on each step.
new_topic('Optimizing weights with Run and Twiddle')
How about this? Take a step in some direction. If error decreases, continue in that direction. If error does not decrease, pick a random direction. Repeat.
This has been called "run and twiddle", or "run and tumble". This Wikipedia page describes how single cell organisms use cilia in their cell membranes to provide locomotion, either in the current direction as they move in a coordinated fashion, or to cause a spin to change direction.
Now we will be changing wo
and w1
together, so that we can step through the two-dimensional weight space in various directions. Representing our two weights as a two-component vector, and rewriting some functions to accept a vector, simplifies the code a bit. We will actually represent the weights, W
, as a column matrix of two components. It will look like
Let's add a couple of functions to code the application of our model to data, and to calculate the RMSE that we wish to minimize. Then we will modify a bit the plotting functions to accept the name of the model function. We will use a different model towards the end of these notes.
def linear_model(X, W):
# W is column vector
return W[0, :] + X @ W[1:, :]
def rmse(model, X, T, W):
Y = model(X, W)
return np.sqrt(np.mean((T - Y)**2))
def vector_length(v):
return np.sqrt(v.T @ v)
def run_and_twiddle(model_f, rmse_f, X, T, W, dW, nSteps, verbose=False):
step = 0
current_error = rmse_f(model_f, X, T, W)
error_sequence = [current_error]
W_sequence = [W.flatten()]
nFails = 0
while step < nSteps:
new_direction = np.random.uniform(-1, 1, size=(2, 1))
if verbose:
print(f'{step=} {nFails=} {new_direction=}')
new_direction = dW * new_direction / vector_length(new_direction)
if nFails > 10:
dW = dW * 0.8
while step < nSteps:
new_W = W.copy() + new_direction # Why call copy() here?
new_error = rmse_f(model_f, X, T, new_W)
step += 1
if new_error >= current_error:
nFails += 1
break
nFails = 0
if verbose:
print(f'good direction {new_direction=}')
W = new_W
W_sequence.append(W.flatten())
error_sequence.append(new_error)
current_error = new_error
return W, error_sequence, W_sequence
def plot_error_surface(model):
n = 20
wbiass = np.linspace(-5, 5, n)
ws = np.linspace(-0.5, 1.0, n)
wbiass, ws = np.meshgrid(wbiass, ws)
surface = []
for wbi in range(n):
for wi in range(n):
W = np.array([wbiass[wbi, wi], ws[wbi, wi]]).reshape(-1, 1)
surface.append(rmse(model, X, T, W))
plt.contourf(wbiass, ws, np.array(surface).reshape((n, n)), cmap='bone')
# plt.colorbar()
plt.xlabel('w_bias')
plt.ylabel('w')
def show_animation(model, error_sequence, W_sequence, X, T, label):
W_sequence = np.array(W_sequence)
fig = plt.figure(figsize=(15, 8))
plt.subplot(1, 3, 1)
error_line, = plt.plot([], [])
plt.xlim(0, len(error_sequence))
plt.ylim(0, max(error_sequence))
plt.subplot(1, 3, 2)
plot_error_surface(model)
w_line, = plt.plot([], [], 'y.-', label=label)
plt.legend()
plt.subplot(1, 3, 3)
plt.plot(X, T, 'o')
model_line, = plt.plot([], [], 'r-', lw=3, alpha=0.5, label=label)
plt.xlim(0, 24)
plt.ylim(np.min(T), np.max(T))
for i in range(len(W_sequence)):
error_line.set_data(range(i), error_sequence[:i])
w_line.set_data(W_sequence[:i, 0], W_sequence[:i, 1])
Y = model(X, W_sequence[i:i + 1, :].T)
model_line.set_data(X, Y)
# plt.pause(0.001)
clear_output(wait=True)
display(fig)
w0 = -2
w1 = 0.5
W = np.array([w0, w1]).reshape(-1, 1)
nSteps = 400
dW = 10
W, error_sequence, W_sequence = run_and_twiddle(linear_model, rmse, X, T, W, dW, nSteps)
print('Run and Twiddle: Error is {:.2f} W is {:.2f}, {:.2f}'.format(rmse(linear_model, X, T, W), W[0,0], W[1,0]))
show_animation(linear_model, error_sequence, W_sequence, X, T, 'run & twiddle')
new_topic('Optimizing weights with Stochastic Gradient Descent')
Let's call the output of our model Y
and the error being minimized E
.
To perform gradient descent, we need $\frac{\partial E}{\partial W}$. Let's call this dEdW
. The calculation of this can be divided into two factors, using the chain rule.
The error we want to minimize is the squared error, $(T - Y)^2$, and $Y =X W$, so
$$\begin{align*} \frac{\partial E}{\partial W} &= \frac{\partial E}{\partial Y} \frac{\partial Y}{\partial W} \\ \frac{\partial E}{\partial W} &= \frac{\partial (T-Y)^2}{\partial Y} \frac{\partial X W}{\partial W} \\ \frac{\partial E}{\partial W} &= -2 (T - Y) X \end{align*}$$In python, we have
dYdW = X
dEdY = -2 (T - Y)
dEdW = dEdY.T @ dYdW
with some other subtle things to allow us to include the bias weight $w_0$ in the calculations.
# Still using linear_model as defined above
#def linear_model(X, W):
# # W is column vector
# return W[0,:] + X @ W[1:, :]
# Gradient of Y with respect to W
def dYdW(X, T, W):
# One row per sample in X, T. One column per W component.
# For first one, is constant 1.
# For second one, is value of X
return np.insert(X, 0, 1, axis=1)
#Gradient of E with respect to Y
def dEdY(X, T, W):
Y = linear_model(X, W)
return -2 * (T - Y)
# Gradient of E with respect to W.
def dEdW(X, T, W):
result = dEdY(X, T, W).T @ dYdW(X, T, W) / (X.shape[0])
return result.T
Now we can define a function to optimize the weights using stochastic gradient descent. We will call this sgd, since this optimization method is often called SGD.
def sgd(model_f, gradient_f, rmse_f, X, T, W, learning_rate, nSteps):
error_sequence = []
W_sequence = []
for step in range(nSteps):
error_sequence.append(rmse_f(model_f, X, T, W))
W_sequence.append(W.flatten()) # or W.ravel()
W -= learning_rate * gradient_f(X, T, W) # HERE IS THE WHOLE ALGORITHM!!
return W, error_sequence, W_sequence
w0 = -2
w1 = 0.5
W = np.array([w0, w1]).reshape(-1, 1)
nSteps = 200
learning_rate = 0.005
W, error_sequence, W_sequence = sgd(linear_model, dEdW, rmse, X, T, W, learning_rate, nSteps)
print('Gradient Descent: Error is {:.2f} W is {}'.format(rmse(linear_model, X, T, W), W))
show_animation(linear_model, error_sequence, W_sequence, X, T, 'SGD')
new_topic('Optimizing weights with AdamW')
Now let's try a recently developed variation of the gradient descent method, called Adam, for adaptive moment estimation. See ADAM: A Method for Stochastic Optimization by Diederik P. Kingma and Jimmy Lei Ba.
def adamw(model_f, gradient_f, rmse_f, X, T, W, learning_rate, nSteps):
# Commonly used parameter values
alpha = learning_rate
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
W_decay_rate = 0.1 # set to zero for adam algorithm
m = 0
v = 0
error_sequence = []
W_sequence = []
for step in range(nSteps):
error_sequence.append(rmse_f(model_f, X, T, W))
W_sequence.append(W.flatten())
grad = gradient_f(X, T, W)
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad * grad
mhat = m / (1 - beta1 ** (step+1))
vhat = v / (1 - beta2 ** (step+1))
W -= alpha * mhat / (np.sqrt(vhat) + epsilon) + W_decay_rate * W
return W, error_sequence, W_sequence
w0 = -2
w1 = 0.5
W = np.array([w0, w1]).reshape(-1, 1)
nSteps = 200
learning_rate = 0.02
W, error_sequence, W_sequence = adamw(linear_model, dEdW, rmse, X, T, W, learning_rate, nSteps)
print('Adam: Error is {:.2f} W is {:.2f}, {:.2f}'.format(rmse(linear_model, X, T, W), W[0,0], W[1,0]))
show_animation(linear_model, error_sequence, W_sequence, X, T, 'AdamW')
new_topic('Let\'s Try a Slightly Nonlinear Model')
Let's make a quadratic model now. The equation for the output of this model for the $i^{th}$ sample is
$$ y_i = w_0 + w_1 x_i + w_2 x_i^2$$X[:10]
X[:10] ** 2
X[:10] ** [1, 2]
max_degree = 2
X[:10] ** [0, 1, 2]
range(max_degree)
list(range(max_degree))
list(range(max_degree + 1))
X_powers = X ** range(max_degree + 1)
X_powers[:10]
To build our nonlinear model, let's remove the addition of the w0
weight. Instead, we will add an input variable (column of W
) that is a constant 1 for all samples (rows of W
).
def nonlinear_model(X_powers, W):
# W is column vector
return X_powers @ W
def dYdW(X_powers, T, W):
return X_powers
def dEdY(X_powers, T, W):
Y = nonlinear_model(X_powers, W)
return -2 * (T - Y)
# dEdW from before does not need to be changed.
max_degree = 2
w0 = 0
w1 = 0
ws_nonlinear = np.zeros(max_degree - 1)
W = np.hstack((w0, w1, *ws_nonlinear)).reshape(-1, 1)
print(f'{W=}')
learning_rate = 0.01
nSteps = 400
X_powers = X ** range(max_degree + 1)
print(X_powers.shape)
W, error_sequence, W_sequence = adamw(nonlinear_model, dEdW, rmse, X_powers, T, W, learning_rate, nSteps)
print(f'Adam: Error is {rmse(nonlinear_model, X, T, W):.2f} W is {W}')
Demonstrate debugging here
plt.plot(error_sequence);
plt.figure(figsize=(10,8))
plt.plot(X + np.random.uniform(-0.1, 0.1, X.shape), T, '.', label='Training Data')
plt.plot(X, nonlinear_model(X_powers, W), 'ro', label='Prediction on Training Data')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend();
How would you change the previous code cell to plot a continuous red line?
plt.figure(figsize=(10,8))
plt.plot(X + np.random.uniform(-0.1, 0.1, X.shape), T, '.', label='Training Data')
plt.plot(X, nonlinear_model(X_powers, W), 'r', label='Prediction on Training Data')
# only change is here --------------------^
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend();
Whoops.
plt.figure(figsize=(10,8))
plt.plot(X, T, '.', label='Training Data')
order = np.argsort(X, axis=0).ravel() # change to 1-dimensional vector
plt.plot(X[order], nonlinear_model(X_powers, W)[order], 'r', label='Prediction on Training Data')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend();